By: Lincoln Berbert
In today's competitive business environment, understanding customer behavior and preferences is crucial for success. Companies strive to identify meaningful customer segments in order to tailor their products, services, and marketing strategies effectively. This case study focuses on a dataset containing various customer-related features, aiming to discover the best possible customer segments using unsupervised learning techniques, such as dimensionality reduction and clustering. Surveys have shown that up to 80% of participants do business with companies that provide them with a more personalized experience, and that targeted ad campaigns bring in a 77% return on investment.
The dataset under analysis--which is from a higher end specialty goods market business with retail locations, a shipping catalog, and an online web store-- is comprised of diverse features of customers, including demographic, socio-economic, and behavioral data. This rich dataset allows us to examine how customers interact with the company and their purchasing patterns, providing valuable insights for targeted marketing campaigns and personalized customer experiences.
The main objective of this study is to leverage unsupervised learning methods, specifically dimensionality reduction and clustering techniques, to identify meaningful customer segments based on the given dataset. By uncovering these segments, the company can better understand its customers, enabling more effective targeting and resource allocation in future marketing efforts. We will then use these segments to provide campaign recommendations and actionable processes for future marketing campaigns.
Given the customer dataset with features such as year of birth, education, marital status, household composition, income, purchasing behavior, and campaign responses, we aim to:
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
# Libraries for data manipulation and calculations
import pandas as pd
import numpy as np
# Libraries for Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style = 'darkgrid')
# Encoding variables
from sklearn.preprocessing import LabelEncoder
# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Data scaling
from sklearn.preprocessing import StandardScaler
# Calculating distances
from scipy.spatial.distance import cdist, pdist
# K-Means & silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# K-Medoids, Hierarchical clustering, DBSCAN, and Gaussian Mixture
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
# Aditional computations and graphs
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# Supress warnings
import warnings
warnings.filterwarnings("ignore")
# Importing the dataset and creating a copy to not affect the original
marketdata = pd.read_csv("marketing_campaign.csv")
df = marketdata.copy()
# Checking the shape of the data
df.shape
(2240, 27)
The dataset has 2240 rows and 27 columns
# Checking a few random samples from the data
df.sample(10, random_state = 1)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 779 | 10736 | 1971 | Graduation | Single | 72258.0 | 0 | 1 | 12-09-2013 | 28 | 522 | ... | 9 | 5 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 389 | 9799 | 1968 | PhD | Divorced | 83664.0 | 1 | 1 | 08-05-2013 | 57 | 866 | ... | 2 | 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 510 | 9925 | 1981 | PhD | Together | 39665.0 | 1 | 0 | 25-05-2013 | 97 | 127 | ... | 2 | 3 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1553 | 7321 | 1962 | Graduation | Together | 76081.0 | 0 | 0 | 23-05-2014 | 85 | 292 | ... | 5 | 4 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1172 | 8278 | 1990 | PhD | Married | 74214.0 | 0 | 0 | 26-08-2012 | 3 | 863 | ... | 2 | 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1204 | 2995 | 1957 | Master | Together | 66636.0 | 0 | 0 | 17-08-2013 | 64 | 291 | ... | 4 | 9 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2057 | 1071 | 1976 | PhD | Divorced | 70179.0 | 0 | 1 | 21-07-2013 | 10 | 532 | ... | 3 | 13 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1960 | 8925 | 1965 | Master | Married | 70053.0 | 0 | 1 | 03-07-2013 | 38 | 512 | ... | 5 | 10 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 615 | 3083 | 1974 | Graduation | Married | 45837.0 | 1 | 1 | 26-07-2013 | 88 | 215 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 761 | 6887 | 1967 | Graduation | Single | 79146.0 | 1 | 1 | 24-04-2014 | 33 | 245 | ... | 1 | 8 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 27 columns
At first glance:
# checking thte data types and missing values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
# Checking for duplicates
df.duplicated().sum()
0
No duplicates are present.
# Imputing the missing values using the Median as it is less sensitive to outliers than Mean.
median_income = df['Income'].median()
df['Income'].fillna(median_income, inplace = True)
# Checking that the values have been properly imputed
df['Income'].isnull().sum()
0
There are now no missing values in the income column.
df['ID'].nunique()
2240
# Removing the ID column from the data
df.drop('ID', axis = 1, inplace = True)
The ID column is a unique identifer for each individual and as such provides no valuable information and can be dropped.
Questions:
# Checking the summary statistics of the data
df.describe(include = 'all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | NaN | NaN | NaN | 1968.805804 | 11.984069 | 1893.0 | 1959.0 | 1970.0 | 1977.0 | 1996.0 |
| Education | 2240 | 5 | Graduation | 1127 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 2240 | 8 | Married | 864 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income | 2240.0 | NaN | NaN | NaN | 52237.975446 | 25037.955891 | 1730.0 | 35538.75 | 51381.5 | 68289.75 | 666666.0 |
| Kidhome | 2240.0 | NaN | NaN | NaN | 0.444196 | 0.538398 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Teenhome | 2240.0 | NaN | NaN | NaN | 0.50625 | 0.544538 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Dt_Customer | 2240 | 663 | 31-08-2012 | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Recency | 2240.0 | NaN | NaN | NaN | 49.109375 | 28.962453 | 0.0 | 24.0 | 49.0 | 74.0 | 99.0 |
| MntWines | 2240.0 | NaN | NaN | NaN | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | NaN | NaN | NaN | 26.302232 | 39.773434 | 0.0 | 1.0 | 8.0 | 33.0 | 199.0 |
| MntMeatProducts | 2240.0 | NaN | NaN | NaN | 166.95 | 225.715373 | 0.0 | 16.0 | 67.0 | 232.0 | 1725.0 |
| MntFishProducts | 2240.0 | NaN | NaN | NaN | 37.525446 | 54.628979 | 0.0 | 3.0 | 12.0 | 50.0 | 259.0 |
| MntSweetProducts | 2240.0 | NaN | NaN | NaN | 27.062946 | 41.280498 | 0.0 | 1.0 | 8.0 | 33.0 | 263.0 |
| MntGoldProds | 2240.0 | NaN | NaN | NaN | 44.021875 | 52.167439 | 0.0 | 9.0 | 24.0 | 56.0 | 362.0 |
| NumDealsPurchases | 2240.0 | NaN | NaN | NaN | 2.325 | 1.932238 | 0.0 | 1.0 | 2.0 | 3.0 | 15.0 |
| NumWebPurchases | 2240.0 | NaN | NaN | NaN | 4.084821 | 2.778714 | 0.0 | 2.0 | 4.0 | 6.0 | 27.0 |
| NumCatalogPurchases | 2240.0 | NaN | NaN | NaN | 2.662054 | 2.923101 | 0.0 | 0.0 | 2.0 | 4.0 | 28.0 |
| NumStorePurchases | 2240.0 | NaN | NaN | NaN | 5.790179 | 3.250958 | 0.0 | 3.0 | 5.0 | 8.0 | 13.0 |
| NumWebVisitsMonth | 2240.0 | NaN | NaN | NaN | 5.316518 | 2.426645 | 0.0 | 3.0 | 6.0 | 7.0 | 20.0 |
| AcceptedCmp3 | 2240.0 | NaN | NaN | NaN | 0.072768 | 0.259813 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp4 | 2240.0 | NaN | NaN | NaN | 0.074554 | 0.262728 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp5 | 2240.0 | NaN | NaN | NaN | 0.072768 | 0.259813 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp1 | 2240.0 | NaN | NaN | NaN | 0.064286 | 0.245316 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp2 | 2240.0 | NaN | NaN | NaN | 0.012946 | 0.113069 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Complain | 2240.0 | NaN | NaN | NaN | 0.009375 | 0.096391 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Response | 2240.0 | NaN | NaN | NaN | 0.149107 | 0.356274 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
# Gathering all the categorical features, including the encoded values of campaign responses and complaints
categorical_features = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome', 'AcceptedCmp1', 'AcceptedCmp2','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 'Complain']
# Counting and listing the unique values in each categorical feature.
for feature in categorical_features:
print("Unique values in", feature + ":\n")
print(df[feature].value_counts())
print('\n' + '-' * 100 + '\n')
Unique values in Education: Graduation 1127 PhD 486 Master 370 2n Cycle 203 Basic 54 Name: Education, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in Marital_Status: Married 864 Together 580 Single 480 Divorced 232 Widow 77 Alone 3 Absurd 2 YOLO 2 Name: Marital_Status, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in Kidhome: 0 1293 1 899 2 48 Name: Kidhome, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in Teenhome: 0 1158 1 1030 2 52 Name: Teenhome, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in AcceptedCmp1: 0 2096 1 144 Name: AcceptedCmp1, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in AcceptedCmp2: 0 2211 1 29 Name: AcceptedCmp2, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in AcceptedCmp3: 0 2077 1 163 Name: AcceptedCmp3, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in AcceptedCmp4: 0 2073 1 167 Name: AcceptedCmp4, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in AcceptedCmp5: 0 2077 1 163 Name: AcceptedCmp5, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in Response: 0 1906 1 334 Name: Response, dtype: int64 ---------------------------------------------------------------------------------------------------- Unique values in Complain: 0 2219 1 21 Name: Complain, dtype: int64 ----------------------------------------------------------------------------------------------------
# Replacing the unusual values in Education and Marital Status
df["Education"].replace('2n Cycle', 'Master', inplace = True)
df["Marital_Status"].replace({'Alone': 'Single', 'Absurd': 'Single', 'YOLO': 'Single'}, inplace = True)
# Checking to make sure the values got replaced properly
print(df['Education'].value_counts())
print()
print(df['Marital_Status'].value_counts())
Graduation 1127 Master 573 PhD 486 Basic 54 Name: Education, dtype: int64 Married 864 Together 580 Single 487 Divorced 232 Widow 77 Name: Marital_Status, dtype: int64
Our data now has no missing values as well as proper categorical variables. Next we will do Univariate analysis to find outliers and other trends within the data.
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Leading Questions:
# Gathering all the numerical features
numerical_features = ['Year_Birth', 'Income', 'Recency', 'MntFishProducts', 'MntMeatProducts', 'MntFruits', 'MntSweetProducts', 'MntWines', 'MntGoldProds', 'NumDealsPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebPurchases', 'NumWebVisitsMonth']
# Creating a function to graph histograms and boxplots for each numerical variable
def plot_hist_boxplot(feature, data, hist_color='winter', box_color='violet'):
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 5), sharex=True, gridspec_kw = {'height_ratios': (0.25, 0.75)})
# Boxplot
sns.boxplot(data=data, x = feature, color=box_color, ax = ax1)
ax1.set_title(f'Plots of {feature}')
ax1.set_xlabel('')
# Histplot
sns.histplot(data=data, x=feature, kde=True, palette=hist_color, ax=ax2)
plt.tight_layout()
plt.show()
# Printing the graphs and the skew
for feature in numerical_features:
plot_hist_boxplot(feature, df)
print('Skew :', round(df[feature].skew(),2))
print('\n' + '-' * 120 + '\n' * 3)
Skew : -0.35 ------------------------------------------------------------------------------------------------------------------------
Skew : 6.8 ------------------------------------------------------------------------------------------------------------------------
Skew : -0.0 ------------------------------------------------------------------------------------------------------------------------
Skew : 1.92 ------------------------------------------------------------------------------------------------------------------------
Skew : 2.08 ------------------------------------------------------------------------------------------------------------------------
Skew : 2.1 ------------------------------------------------------------------------------------------------------------------------
Skew : 2.14 ------------------------------------------------------------------------------------------------------------------------
Skew : 1.18 ------------------------------------------------------------------------------------------------------------------------
Skew : 1.89 ------------------------------------------------------------------------------------------------------------------------
Skew : 2.42 ------------------------------------------------------------------------------------------------------------------------
Skew : 1.88 ------------------------------------------------------------------------------------------------------------------------
Skew : 0.7 ------------------------------------------------------------------------------------------------------------------------
Skew : 1.38 ------------------------------------------------------------------------------------------------------------------------
Skew : 0.21 ------------------------------------------------------------------------------------------------------------------------
# Function to update outlier Year_Birth values
def update_year_birth(year):
if year >= 1800 and year < 1901:
return year + 100
else:
return year
# Apply the function to the Year_Birth column
df['Year_Birth'] = df['Year_Birth'].apply(update_year_birth)
# Verify that the Year_Birth values have been updated
print(df['Year_Birth'].describe())
count 2240.000000 mean 1968.939732 std 11.740777 min 1940.000000 25% 1959.000000 50% 1970.000000 75% 1977.000000 max 2000.000000 Name: Year_Birth, dtype: float64
plot_hist_boxplot('Year_Birth', df)
# Calculate the interquartile range (IQR) for the Income variable
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
# Identify the upper bound for outliers
upper_bound = Q3 + 1.5 * IQR
# Find the number of outliers above the upper bound
outliers = df[df['Income'] > upper_bound]
print(f"Number of outliers above the upper bound: {len(outliers)}")
# Display the outliers
print("Outliers:")
print(outliers['Income'])
Number of outliers above the upper bound: 8 Outliers: 164 157243.0 617 162397.0 655 153924.0 687 160803.0 1300 157733.0 1653 157146.0 2132 156924.0 2233 666666.0 Name: Income, dtype: float64
These are some extreme outliers. Because there are only a few of them, we will drop them.
# Drop the outliers from the dataframe
df = df.drop(outliers.index)
# Custom function to add counts and percentages on top of the bars in a countplot
def add_counts_on_bars(ax):
total = len(df)
for p in ax.patches:
count = p.get_height()
percentage = 100 * count / total
ax.annotate(f'{count} ({percentage:.1f}%)', (p.get_x() + p.get_width() / 2, count), ha='center', va='bottom')
# Visualize the distribution of categorical features
for feature in categorical_features:
fig, ax = plt.subplots()
sns.countplot(x=feature, data=df, ax=ax)
plt.title(f'Distribution of {feature}')
add_counts_on_bars(ax) # Add the counts on top of the bars
plt.show()
# Calculate the correlation matrix for numerical features
corr_matrix = df[numerical_features].corr()
# Visualize the correlation matrix using a heatmap
plt.figure(figsize = (10,10))
sns.heatmap(corr_matrix, annot=True, vmin = -1, vmax = 1, cmap='PuOr', fmt='.2f')
plt.show()
# Define the feature pairs
feature_groups = [
[
('Education', 'MntFishProducts'),
('Education', 'MntMeatProducts'),
('Education', 'MntFruits'),
('Education', 'MntSweetProducts'),
('Education', 'MntWines'),
('Education', 'MntGoldProds'),
],
[
('Marital_Status', 'MntFishProducts'),
('Marital_Status', 'MntMeatProducts'),
('Marital_Status', 'MntFruits'),
('Marital_Status', 'MntSweetProducts'),
('Marital_Status', 'MntWines'),
('Marital_Status', 'MntGoldProds'),
],
[
('Education', 'NumDealsPurchases'),
('Education', 'NumCatalogPurchases'),
('Education', 'NumStorePurchases'),
('Education', 'NumWebPurchases'),
('Education', 'NumWebVisitsMonth'),
],
[
('Marital_Status', 'NumDealsPurchases'),
('Marital_Status', 'NumCatalogPurchases'),
('Marital_Status', 'NumStorePurchases'),
('Marital_Status', 'NumWebPurchases'),
('Marital_Status', 'NumWebVisitsMonth'),
],
]
# Define color schemes for each group
color_palettes = ['Blues', 'Greens', 'Oranges', 'Purples']
# Iterate through the feature groups and create the bar plots
for palette, group in zip(color_palettes, feature_groups):
n_rows = len(group)
fig, axes = plt.subplots(1, n_rows, figsize=(20, 4))
for idx, (x, y) in enumerate(group):
sns.barplot(x=x, y=y, data=df, ci=None, ax=axes[idx], palette=palette)
if x == 'Marital_Status':
axes[idx].set_xticklabels(axes[idx].get_xticklabels(), rotation=45)
plt.tight_layout(rect=[0, 0, 1, 0.95]) # Add space at the top
plt.show()
# Choose the demographic variables to analyze
demographic_features = [
('Education', 'Income'),
('Marital_Status', 'Income'),
('Kidhome', 'Income'),
('Teenhome', 'Income'),
]
n_cols = len(demographic_features)
fig, axes = plt.subplots(2, n_cols, figsize=(20, 8))
# Plot mean income
for idx, (x, y) in enumerate(demographic_features):
sns.barplot(x=x, y=y, data=df, ci=None, ax=axes[0, idx])
# Plot median income
for idx, (x, y) in enumerate(demographic_features):
sns.barplot(x=x, y=y, data=df, ci=None, estimator=np.median, ax=axes[1, idx])
plt.tight_layout()
plt.show()
# Set up a 2x2 grid of plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Plot Teenhome against Marital_Status
sns.barplot(x='Marital_Status', y='Teenhome', data=df, ci=None, ax=axes[0, 0])
axes[0, 0].set_ylabel('Average Number of Teenagers')
axes[0, 0].set_title('Average Number of Teenagers by Marital Status')
# Plot Teenhome against Education
sns.barplot(x='Education', y='Teenhome', data=df, ci=None, ax=axes[0, 1])
axes[0, 1].set_ylabel('Average Number of Teenagers')
axes[0, 1].set_title('Average Number of Teenagers by Education')
# Plot Kidhome against Marital_Status
sns.barplot(x='Marital_Status', y='Kidhome', data=df, ci=None, ax=axes[1, 0])
axes[1, 0].set_ylabel('Average Number of Small Children')
axes[1, 0].set_title('Average Number of Small Children by Marital Status')
# Plot Kidhome against Education
sns.barplot(x='Education', y='Kidhome', data=df, ci=None, ax=axes[1, 1])
axes[1, 1].set_ylabel('Average Number of Small Children')
axes[1, 1].set_title('Average Number of Small Children by Education')
# Adjust the layout
plt.tight_layout()
plt.show()
Think About It:
# Create a feature of total children (teens + kids) in the household
df['ChildTotal'] = df['Teenhome'] + df['Kidhome']
# Creating a feature for total members of each household
# Add 1 to the household size for married couples or couples living together
df['AdditionalAdult'] = df['Marital_Status'].apply(lambda x: 1 if x in ['Married', 'Together'] else 0)
# Calculate the household size by combining Teenhome, Kidhome, and any additional adults
df['HouseholdSize'] = df['Teenhome'] + df['Kidhome'] + 1 + df['AdditionalAdult']
# Drop the 'AdditionalAdult' column as it's no longer needed
df.drop('AdditionalAdult', axis=1, inplace=True)
# Creating a feature for total amount spent
df['TotalSpent'] = (
df['MntFishProducts'] +
df['MntMeatProducts'] +
df['MntFruits'] +
df['MntSweetProducts'] +
df['MntWines'] +
df['MntGoldProds']
)
# Converting year_birth to age assuming the data was gathered in 2016
df['Age'] = 2016 - df['Year_Birth']
# Calculate the total number of purchases of each customer
df['TotalPurchases'] = (
df['NumDealsPurchases'] +
df['NumCatalogPurchases'] +
df['NumStorePurchases'] +
df['NumWebPurchases']
)
# Calculate how many offers each customer has accepted
df['TotalOffersAccepted'] = (
df['AcceptedCmp1'] +
df['AcceptedCmp2'] +
df['AcceptedCmp3'] +
df['AcceptedCmp4'] +
df['AcceptedCmp5'] +
df['Response']
)
# Filter out individuals with TotalPurchases of 0
df = df[df['TotalPurchases'] > 0]
# Create a feature 'AmountPerPurchase'
df['AmountPerPurchase'] = df['TotalSpent'] / df['TotalPurchases']
df['TotalPurchases'].sort_values()
2214 1
2228 1
774 1
1328 2
9 2
..
1669 34
1252 34
412 35
432 39
21 43
Name: TotalPurchases, Length: 2230, dtype: int64
# Convert 'Dt_Customer' to a datetime object
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
# Collection date
collection_date = pd.Timestamp('2016-01-01')
# Calculate the number of days the customer has been with the company
df['DaysWithCompany'] = (collection_date - df['Dt_Customer']).dt.days
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Year_Birth | 1957 | 1954 | 1965 | 1984 | 1981 |
| Education | Graduation | Graduation | Graduation | Graduation | PhD |
| Marital_Status | Single | Single | Together | Together | Married |
| Income | 58138.0 | 46344.0 | 71613.0 | 26646.0 | 58293.0 |
| Kidhome | 0 | 1 | 0 | 1 | 1 |
| Teenhome | 0 | 1 | 0 | 0 | 0 |
| Dt_Customer | 2012-04-09 00:00:00 | 2014-08-03 00:00:00 | 2013-08-21 00:00:00 | 2014-10-02 00:00:00 | 2014-01-19 00:00:00 |
| Recency | 58 | 38 | 26 | 26 | 94 |
| MntWines | 635 | 11 | 426 | 11 | 173 |
| MntFruits | 88 | 1 | 49 | 4 | 43 |
| MntMeatProducts | 546 | 6 | 127 | 20 | 118 |
| MntFishProducts | 172 | 2 | 111 | 10 | 46 |
| MntSweetProducts | 88 | 1 | 21 | 3 | 27 |
| MntGoldProds | 88 | 6 | 42 | 5 | 15 |
| NumDealsPurchases | 3 | 2 | 1 | 2 | 5 |
| NumWebPurchases | 8 | 1 | 8 | 2 | 5 |
| NumCatalogPurchases | 10 | 1 | 2 | 0 | 3 |
| NumStorePurchases | 4 | 2 | 10 | 4 | 6 |
| NumWebVisitsMonth | 7 | 5 | 4 | 6 | 5 |
| AcceptedCmp3 | 0 | 0 | 0 | 0 | 0 |
| AcceptedCmp4 | 0 | 0 | 0 | 0 | 0 |
| AcceptedCmp5 | 0 | 0 | 0 | 0 | 0 |
| AcceptedCmp1 | 0 | 0 | 0 | 0 | 0 |
| AcceptedCmp2 | 0 | 0 | 0 | 0 | 0 |
| Complain | 0 | 0 | 0 | 0 | 0 |
| Response | 1 | 0 | 0 | 0 | 0 |
| ChildTotal | 0 | 2 | 0 | 1 | 1 |
| HouseholdSize | 1 | 3 | 2 | 3 | 3 |
| TotalSpent | 1617 | 27 | 776 | 53 | 422 |
| Age | 59 | 62 | 51 | 32 | 35 |
| TotalPurchases | 25 | 6 | 21 | 8 | 19 |
| TotalOffersAccepted | 1 | 0 | 0 | 0 | 0 |
| AmountPerPurchase | 64.68 | 4.5 | 36.952381 | 6.625 | 22.210526 |
| DaysWithCompany | 1362 | 516 | 863 | 456 | 712 |
All our new features seem to be showing up properly. This will allow us to do some new analysis as well as to drop a lot of other variables in our data preparation steps later on.
new_features = ['Age', 'TotalPurchases', 'TotalOffersAccepted', 'DaysWithCompany', 'TotalSpent', 'AmountPerPurchase']
# Printing the graphs and the skew
for feature in new_features:
plot_hist_boxplot(feature, df)
print('Skew :', round(df[feature].skew(),2))
print('\n' + '-' * 120 + '\n' * 3)
Skew : 0.08 ------------------------------------------------------------------------------------------------------------------------
Skew : 0.23 ------------------------------------------------------------------------------------------------------------------------
Skew : 2.42 ------------------------------------------------------------------------------------------------------------------------
Skew : 0.01 ------------------------------------------------------------------------------------------------------------------------
Skew : 0.86 ------------------------------------------------------------------------------------------------------------------------
Skew : 22.2 ------------------------------------------------------------------------------------------------------------------------
# Creating Countplot for HouseholdSize
plt.figure(figsize=(12, 6))
ax = sns.countplot(data=df, x='HouseholdSize')
# Annotate the bars with counts and percentages
for p in ax.patches:
height = p.get_height()
ax.annotate(f'{height} ({height/len(df) * 100:.2f}%)',
(p.get_x() + p.get_width() / 2, height),
ha='center', va='bottom')
plt.title('Countplot for HouseholdSize')
plt.show()
# Remove income outliers
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
df_filtered = df[(df['Income'] >= Q1 - 1.5 * IQR) & (df['Income'] <= Q3 + 1.5 * IQR)]
# Creating graphs for each of the different comparisons
feature_pairs = [
('Age', 'TotalPurchases', 'scatterplot'),
('DaysWithCompany', 'TotalPurchases', 'scatterplot'),
('Age', 'Income', 'scatterplot'),
('TotalSpent', 'Income', 'scatterplot'),
('TotalOffersAccepted', 'Age', 'swarmplot'),
('TotalOffersAccepted', 'DaysWithCompany', 'swarmplot'),
('HouseholdSize', 'Income', 'barplot'),
('HouseholdSize', 'AmountPerPurchase', 'barplot')
]
# Create subplots
n_rows, n_cols = 4, 2
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16, 20))
axes = axes.flatten()
for idx, (x, y, plot_type) in enumerate(feature_pairs):
if plot_type == 'scatterplot':
# Create scatter plot
sns.scatterplot(data=df_filtered, x=x, y=y, alpha=0.5, ax=axes[idx])
elif plot_type == 'barplot':
# Create bar plot
sns.barplot(data=df_filtered, x=x, y=y, ci=None, ax=axes[idx])
elif plot_type == 'swarmplot':
# Create violin plot
sns.swarmplot(data=df_filtered, x=x, y=y, ax=axes[idx])
axes[idx].set_title(f'{x} vs. {y}')
# Adjust the layout
plt.tight_layout()
plt.show()
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
These different observations already start to give us an idea of some of the clusters we might see within our groups. We are likely to find a few groups relating to economic status/class, and a few groups relating heavily to relationship status or a mix of the two. Hypotheses within these can include that individuals with less money and education will purchase differently from those who have higher incomes and more education, and that both of these groups will purchase differently depending on if they are single or with a partner. Thus we can predict ~4 clusters already. We will soon see if this hypothesis holds up or if there are any other important clusters that we might find within the dataset.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2230 entries, 0 to 2239 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2230 non-null int64 1 Education 2230 non-null object 2 Marital_Status 2230 non-null object 3 Income 2230 non-null float64 4 Kidhome 2230 non-null int64 5 Teenhome 2230 non-null int64 6 Dt_Customer 2230 non-null datetime64[ns] 7 Recency 2230 non-null int64 8 MntWines 2230 non-null int64 9 MntFruits 2230 non-null int64 10 MntMeatProducts 2230 non-null int64 11 MntFishProducts 2230 non-null int64 12 MntSweetProducts 2230 non-null int64 13 MntGoldProds 2230 non-null int64 14 NumDealsPurchases 2230 non-null int64 15 NumWebPurchases 2230 non-null int64 16 NumCatalogPurchases 2230 non-null int64 17 NumStorePurchases 2230 non-null int64 18 NumWebVisitsMonth 2230 non-null int64 19 AcceptedCmp3 2230 non-null int64 20 AcceptedCmp4 2230 non-null int64 21 AcceptedCmp5 2230 non-null int64 22 AcceptedCmp1 2230 non-null int64 23 AcceptedCmp2 2230 non-null int64 24 Complain 2230 non-null int64 25 Response 2230 non-null int64 26 ChildTotal 2230 non-null int64 27 HouseholdSize 2230 non-null int64 28 TotalSpent 2230 non-null int64 29 Age 2230 non-null int64 30 TotalPurchases 2230 non-null int64 31 TotalOffersAccepted 2230 non-null int64 32 AmountPerPurchase 2230 non-null float64 33 DaysWithCompany 2230 non-null int64 dtypes: datetime64[ns](1), float64(2), int64(29), object(2) memory usage: 674.3+ KB
We are going to drop all the categorical variables because they don't create distance, and we will be using distance algorithms. We will also drop demographic data.
# List of features to drop
drop_features = ['Year_Birth', 'Dt_Customer', 'Education', 'Marital_Status', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 'Complain', 'Kidhome', 'Teenhome', 'ChildTotal', 'HouseholdSize', 'Age', 'Income']
# Drop features
df_model = df.drop(drop_features, axis=1)
df_model.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2230 entries, 0 to 2239 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Recency 2230 non-null int64 1 MntWines 2230 non-null int64 2 MntFruits 2230 non-null int64 3 MntMeatProducts 2230 non-null int64 4 MntFishProducts 2230 non-null int64 5 MntSweetProducts 2230 non-null int64 6 MntGoldProds 2230 non-null int64 7 NumDealsPurchases 2230 non-null int64 8 NumWebPurchases 2230 non-null int64 9 NumCatalogPurchases 2230 non-null int64 10 NumStorePurchases 2230 non-null int64 11 NumWebVisitsMonth 2230 non-null int64 12 TotalSpent 2230 non-null int64 13 TotalPurchases 2230 non-null int64 14 TotalOffersAccepted 2230 non-null int64 15 AmountPerPurchase 2230 non-null float64 16 DaysWithCompany 2230 non-null int64 dtypes: float64(1), int64(16) memory usage: 378.1 KB
df_model.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | TotalSpent | TotalPurchases | TotalOffersAccepted | AmountPerPurchase | DaysWithCompany | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 1617 | 25 | 1 | 64.680000 | 1362 |
| 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 27 | 6 | 0 | 4.500000 | 516 |
| 2 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 776 | 21 | 0 | 36.952381 | 863 |
| 3 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 53 | 8 | 0 | 6.625000 | 456 |
| 4 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 422 | 19 | 0 | 22.210526 | 712 |
# Calculate the correlation matrix
corr_matrix = df_model.corr()
# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Remaining Features')
plt.show()
# Create a StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the data and transform it
df_scaled = scaler.fit_transform(df_model)
# Convert the scaled data back to a DataFrame
df_scaled = pd.DataFrame(df_scaled, columns=df_model.columns)
df_scaled.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | TotalSpent | TotalPurchases | TotalOffersAccepted | AmountPerPurchase | DaysWithCompany | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.306673 | 0.979391 | 1.546758 | 1.734463 | 2.456041 | 1.471693 | 0.838983 | 0.358982 | 1.406121 | 2.633344 | -0.560010 | 0.697903 | 1.681105 | 1.330408 | 0.619884 | 0.696772 | 1.977019 |
| 1 | -0.384115 | -0.873681 | -0.637898 | -0.726847 | -0.652363 | -0.633485 | -0.731893 | -0.169072 | -1.118556 | -0.586150 | -1.177627 | -0.134801 | -0.963121 | -1.165801 | -0.503971 | -0.639154 | -1.666952 |
| 2 | -0.798588 | 0.358730 | 0.567430 | -0.175331 | 1.340673 | -0.149536 | -0.042240 | -0.697126 | 1.406121 | -0.228428 | 1.292842 | -0.551153 | 0.282492 | 0.804890 | -0.503971 | 0.081251 | -0.172321 |
| 3 | -0.798588 | -0.873681 | -0.562565 | -0.663035 | -0.506085 | -0.585090 | -0.751050 | -0.169072 | -0.757888 | -0.943871 | -0.560010 | 0.281551 | -0.919882 | -0.903042 | -0.503971 | -0.591981 | -1.925389 |
| 4 | 1.550091 | -0.392595 | 0.416764 | -0.216353 | 0.152165 | -0.004351 | -0.559479 | 1.415090 | 0.324116 | 0.129293 | 0.057607 | -0.134801 | -0.306222 | 0.542132 | -0.503971 | -0.246001 | -0.822722 |
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=1)
df_tsne = tsne.fit_transform(df_scaled)
# Create a dataframe for the t-SNE components
df_tsne = pd.DataFrame(data=df_tsne, columns=['Component 1', 'Component 2'])
# Visualize the results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Component 1', y='Component 2', data=df_tsne)
plt.title('t-SNE Visualization of the Dataset')
plt.show()
There doesn't seem to be any very distinct clusters here. We can test other preplexity values and see if they provide any more insights. There seems to be some noticeable smaller clusters of overlapping points, but there are a lot of them and not a ton of variation between them.
# Apply t-SNE with perplexity 10
tsne_10 = TSNE(n_components=2, perplexity=10, random_state=1)
df_tsne_10 = tsne_10.fit_transform(df_scaled)
df_tsne_10 = pd.DataFrame(data=df_tsne_10, columns=['Component 1', 'Component 2'])
# Apply t-SNE with perplexity 20
tsne_20 = TSNE(n_components=2, perplexity=20, random_state=1)
df_tsne_20 = tsne_20.fit_transform(df_scaled)
df_tsne_20 = pd.DataFrame(data=df_tsne_20, columns=['Component 1', 'Component 2'])
# Apply t-SNE with perplexity 40
tsne_40 = TSNE(n_components=2, perplexity=40, random_state=1)
df_tsne_40 = tsne_40.fit_transform(df_scaled)
df_tsne_40 = pd.DataFrame(data=df_tsne_40, columns=['Component 1', 'Component 2'])
# Apply t-SNE with perplexity 100
tsne_100 = TSNE(n_components=2, perplexity=100, random_state=1)
df_tsne_100 = tsne_100.fit_transform(df_scaled)
df_tsne_100 = pd.DataFrame(data=df_tsne_100, columns=['Component 1', 'Component 2'])
# Merge t-SNE dataframes with the original dataframe to bring back the demographic variable 'Marital_Status'
df_tsne_10['Index'] = df.index
df_tsne_20['Index'] = df.index
df_tsne_40['Index'] = df.index
df_tsne_100['Index'] = df.index
df_tsne_10_merged = df_tsne_10.merge(df[['Marital_Status']], left_on='Index', right_index=True)
df_tsne_20_merged = df_tsne_20.merge(df[['Marital_Status']], left_on='Index', right_index=True)
df_tsne_40_merged = df_tsne_40.merge(df[['Marital_Status']], left_on='Index', right_index=True)
df_tsne_100_merged = df_tsne_100.merge(df[['Marital_Status']], left_on='Index', right_index =True)
# Visualize the results
fig, axes = plt.subplots(1, 4, figsize=(24, 6))
sns.scatterplot(ax=axes[0], x='Component 1', y='Component 2', data=df_tsne_10_merged, hue = 'Marital_Status')
axes[0].set_title('t-SNE Visualization (Perplexity 10)')
sns.scatterplot(ax=axes[1], x='Component 1', y='Component 2', data=df_tsne_20_merged, hue = 'Marital_Status')
axes[1].set_title('t-SNE Visualization (Perplexity 20)')
sns.scatterplot(ax=axes[2], x='Component 1', y='Component 2', data=df_tsne_40_merged, hue = 'Marital_Status')
axes[2].set_title('t-SNE Visualization (Perplexity 40)')
sns.scatterplot(ax=axes[3], x='Component 1', y='Component 2', data=df_tsne_100_merged, hue = 'Marital_Status')
axes[3].set_title('t-SNE Visualization (Perplexity 100)')
plt.show()
Once again there is minimal to no clustering observeable here, even after including a demographic variable in an attempt to observe patterns. None of the perplexity values seem to be giving us any noticeable clusters. We will move on to PCA.
Think about it:
Applying PCA can provide some dimensionality reduction, but we will also lose some interpretability in the process. We also might be able to reduce some of the noise of the data by applying PCA. We will test it here and see if we should perform our clusters on our original data or not.
# Fit PCA on the scaled data
n = df_scaled.shape[1]
pca = PCA(n_components = n, random_state = 1)
pca.fit(df_scaled)
# Calculate the cumulative explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)
# Plot the explained variance graph
plt.figure(figsize=(6, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components')
plt.grid(True)
plt.show()
Let's check the elbow graph and see if that gives us a better choice of components.
# Plot the elbow graph
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(pca.explained_variance_) + 1), pca.explained_variance_, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance')
plt.title('Elbow Plot for Principal Components')
plt.grid(True)
plt.show()
# Create a DataFrame with PCA coefficients
pca_coefficients = pd.DataFrame(np.round(pca.components_[:3,:], 2), columns=df_scaled.columns, index=['PC1', 'PC2', 'PC3'])
# Transpose the DataFrame
pca_coefficients = pca_coefficients.T
# Display the table with colored cells
plt.figure(figsize=(4, 12))
sns.heatmap(pca_coefficients, annot=True, cmap="coolwarm_r", center=0, vmin=-1, vmax=1)
plt.title('PCA Coefficients for 3 Principal Components')
plt.show()
# Fit and transform the pca function on scaled data
df_pca = pd.DataFrame(pca.fit_transform(df_scaled))
df_pca_withdem = pd.concat([df_pca, df], axis = 1)
# Plot the scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_pca_withdem, x=df_pca[0], y=df_pca[1], hue = 'Marital_Status')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title('Scatterplot of the First Two Principal Components')
plt.show()
Given that we need at least 5 or more principal components to cover a significant amount of the variance, and that there are not initially observeable clusters within the data, we should be skeptical of its contribution to our analyses, and adding it might risk reducing interpretability without providing any added value.
We will test K-Means using PCA and not using PCA before clustering, but if it is less interpretable and doesn't provide meaningful analysis, for the future methods we will drop PCA as a piece of our analysis.
We will create 2 K-Means models, one where we use the PCA dataset and one where we use the original scaled dataset. First we will find the optimal value using Sum of Squared Errors to find the optimal K value.
# Calculate the SSE for different K values
K_values = range(1, 11)
sse = []
for K in K_values:
kmeans = KMeans(n_clusters=K, random_state=1)
kmeans.fit(df_scaled)
sse.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(16, 8))
plt.plot(K_values, sse, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.title('Elbow Method for Optimal K-value')
plt.show()
K_values = range(2, 11) # Trying K values from 2 to 10
silhouette_scores = []
for K in K_values:
kmeans = KMeans(n_clusters=K, random_state=1)
cluster_labels = kmeans.fit_predict(df_scaled)
silhouette_avg = silhouette_score(df_scaled, cluster_labels)
silhouette_scores.append(silhouette_avg)
print(f"For K = {K}, the average silhouette_score is: {silhouette_avg}")
plt.figure(figsize=(10, 6))
plt.plot(K_values, silhouette_scores, 'bx-')
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score for Different K Values")
plt.show()
For K = 2, the average silhouette_score is: 0.35053865728907435 For K = 3, the average silhouette_score is: 0.27186197643097576 For K = 4, the average silhouette_score is: 0.25229910270415834 For K = 5, the average silhouette_score is: 0.23745618918030156 For K = 6, the average silhouette_score is: 0.2386255804586004 For K = 7, the average silhouette_score is: 0.2162783763062503 For K = 8, the average silhouette_score is: 0.1297028510069316 For K = 9, the average silhouette_score is: 0.13012755943470536 For K = 10, the average silhouette_score is: 0.12142398323171581
On both our graphs K = 2 seems like the most optimal, however 2 clusters would not give us very meaningful analyses or clustering information. The next best value would be K = 3. This is because it has a noticeable drop on the elbow plot and also has the highest silhouette score out of the remaining values besides K = 2. We will use K = 3 for our K-Means models
# Create K-means models
kmeans_pca = KMeans(n_clusters=3, random_state=1)
kmeans_original = KMeans(n_clusters=3, random_state=1)
# Fit the models
kmeans_pca.fit(df_pca)
kmeans_original.fit(df_scaled)
# Get the cluster labels
labels_pca = kmeans_pca.labels_
labels_original = kmeans_original.labels_
# Create a new DataFrame with the cluster labels
df_labels_original = df_scaled.copy()
df_labels_original['Clusters'] = labels_original
df_labels_pca = df_pca.copy()
df_labels_pca['Clusters'] = labels_pca
# Checking to see if PCA is providing valuable clustering differences by comparing the value counts of each cluster
value_counts_original = df_labels_original['Clusters'].value_counts()
value_counts_pca = df_labels_pca['Clusters'].value_counts()
print("Value counts for original dataset clusters:")
print(value_counts_original)
print("\nValue counts for PCA transformed dataset clusters:")
print(value_counts_pca)
Value counts for original dataset clusters: 0 1060 2 607 1 563 Name: Clusters, dtype: int64 Value counts for PCA transformed dataset clusters: 0 1060 2 607 1 563 Name: Clusters, dtype: int64
It seems our PCA and original dataset models are resulting in the same clusters. This means that PCA is not capturing any new analysis and to maintain high interpretability we will not use PCA in our clustering analysis. From here on out all our models will be using our original data.
# Copying the information to new dataframes
df1 = df.copy()
df_Kmeans = df_scaled.copy()
df1['KMeansClusters'] = labels_original
df_Kmeans['KMeansClusters'] = labels_original
kmeans_cluster_profile = df1.groupby("KMeansClusters").mean()
# Creating the ValueCount feature for observation purposes
kmeans_cluster_profile["ValueCount"] = (value_counts_original)
kmeans_cluster_profile = kmeans_cluster_profile.T
# Highlighting the maximum average value among all the clusters for each of the variables
kmeans_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| KMeansClusters | 0 | 1 | 2 |
|---|---|---|---|
| Year_Birth | 1971.023585 | 1968.214920 | 1965.925865 |
| Income | 35357.330660 | 75916.195382 | 57680.852554 |
| Kidhome | 0.754717 | 0.035524 | 0.281713 |
| Teenhome | 0.475472 | 0.206039 | 0.843493 |
| Recency | 49.283019 | 50.376554 | 47.673806 |
| MntWines | 45.128302 | 633.801066 | 454.586491 |
| MntFruits | 5.065094 | 70.046181 | 23.186161 |
| MntMeatProducts | 23.672642 | 461.600355 | 138.413509 |
| MntFishProducts | 7.373585 | 102.397869 | 30.570016 |
| MntSweetProducts | 5.163208 | 71.344583 | 24.663921 |
| MntGoldProds | 15.287736 | 79.190053 | 62.253707 |
| NumDealsPurchases | 2.024528 | 1.300178 | 3.782537 |
| NumWebPurchases | 2.141509 | 5.213144 | 6.492586 |
| NumCatalogPurchases | 0.572642 | 6.023091 | 3.107084 |
| NumStorePurchases | 3.292453 | 8.328597 | 7.883031 |
| NumWebVisitsMonth | 6.372642 | 2.886323 | 5.752883 |
| AcceptedCmp3 | 0.067925 | 0.085258 | 0.070840 |
| AcceptedCmp4 | 0.015094 | 0.133215 | 0.125206 |
| AcceptedCmp5 | 0.000000 | 0.264654 | 0.023064 |
| AcceptedCmp1 | 0.000943 | 0.214920 | 0.036244 |
| AcceptedCmp2 | 0.001887 | 0.031972 | 0.014827 |
| Complain | 0.011321 | 0.007105 | 0.008237 |
| Response | 0.083962 | 0.291297 | 0.133443 |
| ChildTotal | 1.230189 | 0.241563 | 1.125206 |
| HouseholdSize | 2.881132 | 1.859680 | 2.782537 |
| TotalSpent | 101.690566 | 1418.380107 | 733.673806 |
| Age | 44.976415 | 47.785080 | 50.074135 |
| TotalPurchases | 8.031132 | 20.865009 | 21.265239 |
| TotalOffersAccepted | 0.169811 | 1.021314 | 0.403624 |
| AmountPerPurchase | 11.315050 | 73.519124 | 34.359878 |
| DaysWithCompany | 864.909434 | 914.646536 | 958.739703 |
| ValueCount | 1060.000000 | 563.000000 | 607.000000 |
df1.groupby(["KMeansClusters", "Marital_Status"])['Year_Birth'].count()
KMeansClusters Marital_Status
0 Divorced 103
Married 413
Single 239
Together 277
Widow 28
1 Divorced 54
Married 201
Single 135
Together 147
Widow 26
2 Divorced 73
Married 247
Single 112
Together 152
Widow 23
Name: Year_Birth, dtype: int64
df1.groupby(["KMeansClusters", "HouseholdSize"])['Year_Birth'].count()
KMeansClusters HouseholdSize
0 1 48
2 282
3 499
4 210
5 21
1 1 175
2 293
3 94
4 1
2 1 29
2 185
3 293
4 89
5 11
Name: Year_Birth, dtype: int64
df1.groupby(["KMeansClusters", "Education"])['Year_Birth'].count()
KMeansClusters Education
0 Basic 52
Graduation 519
Master 291
PhD 198
1 Graduation 321
Master 132
PhD 110
2 Basic 2
Graduation 282
Master 149
PhD 174
Name: Year_Birth, dtype: int64
-
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']
# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(features_to_plot):
sns.boxplot(x='KMeansClusters', y=feature, data=df1, ax=axes[i])
axes[i].set_title(feature)
# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
fig.delaxes(axes[i])
plt.tight_layout(pad = 3.0)
plt.show()
Cluster 0
This cluster seems to represent low-income individuals
Cluster 1
This cluster seems to represent high-income individuals
Cluster 2
This cluster seems to represent mid-income individuals
These clusters are something we can already predict, High income individuals will spend more money overall and have more purchases. This is not very informative. K-Means as a result is not likely to be a very good model for what we want to achieve. Targetting higher income individuals with advertisements isn't guaranteed to give us any gain, and we will miss out on the ability to market potential products to the low and middle income individuals if we do so.
Given that K = 3 on K-Means gave us uninformative groupings, because we are assuming that K = 3 will give us similar results on K-Medoids we will test K = 4 and K = 5 as well to get possible new insights.
# Creating a new dataframe copy
Kmed_df = df_scaled.copy()
Kmed4_df = df_scaled.copy()
Kmed5_df = df_scaled.copy()
df2 = df.copy()
# Create K-means models
Kmed = KMedoids(n_clusters=3, random_state=1)
Kmed4 = KMedoids(n_clusters = 4, random_state=1)
Kmed5 = KMedoids(n_clusters = 5, random_state=1)
# Fit the models
Kmed.fit(Kmed_df)
Kmed4.fit(Kmed4_df)
Kmed5.fit(Kmed5_df)
# Get the cluster labels
Kmed_labels = Kmed.labels_
Kmed4_labels = Kmed4.labels_
Kmed5_labels = Kmed5.labels_
# Create a new DataFrame with the cluster labels
Kmed_df['KMedClusters'] = Kmed_labels
Kmed4_df['KMedClusters'] = Kmed4_labels
Kmed5_df['KMedClusters'] = Kmed5_labels
df2['KMed3Clusters'] = Kmed_labels
df2['KMed4Clusters'] = Kmed4_labels
df2['KMed5Clusters'] = Kmed5_labels
Kmed_df['KMedClusters'].value_counts()
1 1153 0 543 2 534 Name: KMedClusters, dtype: int64
Kmed4_df['KMedClusters'].value_counts()
1 964 3 513 0 436 2 317 Name: KMedClusters, dtype: int64
Kmed5_df['KMedClusters'].value_counts()
1 632 2 578 0 481 3 276 4 263 Name: KMedClusters, dtype: int64
# Creating the cluster profiles for each of the models
kmed_cluster_profile = Kmed_df.groupby("KMedClusters").mean()
kmed4_cluster_profile = Kmed4_df.groupby("KMedClusters").mean()
kmed5_cluster_profile = Kmed5_df.groupby("KMedClusters").mean()
# Creating the ValueCount feature for observation purposes
kmed_cluster_profile["ValueCount"] = (Kmed_df['KMedClusters'].value_counts())
kmed_cluster_profile = kmed_cluster_profile.T
kmed4_cluster_profile["ValueCount"] = (Kmed4_df['KMedClusters'].value_counts())
kmed4_cluster_profile = kmed4_cluster_profile.T
kmed5_cluster_profile["ValueCount"] = (Kmed5_df['KMedClusters'].value_counts())
kmed5_cluster_profile = kmed5_cluster_profile.T
# Highlighting the maximum average value among all the clusters for each of the variables
kmed_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| KMedClusters | 0 | 1 | 2 |
|---|---|---|---|
| Recency | -0.067600 | -0.002025 | 0.073111 |
| MntWines | 0.742206 | -0.737002 | 0.836602 |
| MntFruits | 0.005831 | -0.508554 | 1.092127 |
| MntMeatProducts | 0.027268 | -0.626263 | 1.324484 |
| MntFishProducts | 0.023566 | -0.529476 | 1.119270 |
| MntSweetProducts | 0.083972 | -0.517031 | 1.030973 |
| MntGoldProds | 0.242434 | -0.507259 | 0.848741 |
| NumDealsPurchases | 0.731440 | -0.088925 | -0.551763 |
| NumWebPurchases | 1.068036 | -0.626196 | 0.266031 |
| NumCatalogPurchases | 0.439582 | -0.701874 | 1.068480 |
| NumStorePurchases | 0.779298 | -0.715084 | 0.751559 |
| NumWebVisitsMonth | 0.210242 | 0.422742 | -1.126560 |
| TotalSpent | 0.454936 | -0.802633 | 1.270422 |
| TotalPurchases | 1.064020 | -0.812227 | 0.671789 |
| TotalOffersAccepted | 0.164547 | -0.308051 | 0.497817 |
| AmountPerPurchase | 0.112720 | -0.466889 | 0.893476 |
| DaysWithCompany | 0.289718 | -0.124515 | -0.025752 |
| ValueCount | 543.000000 | 1153.000000 | 534.000000 |
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']
# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(features_to_plot):
sns.boxplot(x='KMed3Clusters', y=feature, data=df2, ax=axes[i])
axes[i].set_title(feature)
# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
fig.delaxes(axes[i])
plt.tight_layout(pad = 3.0)
plt.show()
These clusters seem quite similar to our clusters when we did K-Means as well. There are some slight differences in cluster sizes, and a bit more overlap between the middle income and high income clusters, but generally the insights are the same and uninteresting, as such we will look at the K = 4 and K = 5 models instead.
# Highlighting the maximum average value among all the clusters for each of the variables
kmed4_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| KMedClusters | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| Recency | 0.208045 | -0.003716 | -0.457007 | 0.112565 |
| MntWines | 1.075523 | -0.805400 | 0.605775 | 0.225043 |
| MntFruits | 1.215305 | -0.550062 | 0.428329 | -0.263925 |
| MntMeatProducts | 1.509669 | -0.666378 | 0.395495 | -0.275242 |
| MntFishProducts | 1.264682 | -0.574956 | 0.482790 | -0.292763 |
| MntSweetProducts | 1.321236 | -0.543246 | 0.296476 | -0.285287 |
| MntGoldProds | 0.771055 | -0.580345 | 0.305125 | 0.246684 |
| NumDealsPurchases | -0.471855 | -0.237544 | -0.212382 | 0.978647 |
| NumWebPurchases | 0.471361 | -0.787445 | 0.591488 | 0.713610 |
| NumCatalogPurchases | 1.215585 | -0.769835 | 0.834580 | -0.102215 |
| NumStorePurchases | 0.810505 | -0.831339 | 0.973317 | 0.271907 |
| NumWebVisitsMonth | -0.920712 | 0.440490 | -0.775746 | 0.434132 |
| TotalSpent | 1.506375 | -0.870605 | 0.602682 | -0.016698 |
| TotalPurchases | 0.845570 | -0.982360 | 0.883221 | 0.581571 |
| TotalOffersAccepted | 0.764232 | -0.316273 | -0.043084 | -0.028578 |
| AmountPerPurchase | 0.982930 | -0.508595 | 0.296147 | -0.062672 |
| DaysWithCompany | 0.254744 | -0.199049 | -0.373541 | 0.388358 |
| ValueCount | 436.000000 | 964.000000 | 317.000000 | 513.000000 |
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']
# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(features_to_plot):
sns.boxplot(x='KMed4Clusters', y=feature, data=df2, ax=axes[i])
axes[i].set_title(feature)
# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
fig.delaxes(axes[i])
plt.tight_layout(pad = 3.0)
plt.show()
Cluster 0
This cluster seems to represent high income individuals who purchase a lot due to their high income.
Cluster 1
This cluster seems to represent low income individuals who don't purchase much due to their low income.
Cluster 2
This cluster seems to represent relatively high income individuals who perhaps are less inclined to spend money as cluster 0. They still spend a lot but don't quite match the extreme expenditure of their peers.
Cluster 3
This cluster seems to represent the middle class individuals.
These clusters aren't much more meaningful than the K = 3 models. We really only added one more category which was a upper-middle class type of tier, but it still breaks down along income pretty typically. Let's look at K = 5 and see if that is any better.
# Highlighting the maximum average value among all the clusters for each of the variables
kmed5_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| KMedClusters | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| Recency | 0.053264 | -0.341651 | -0.098179 | -0.104922 | 1.049466 |
| MntWines | 1.045730 | -0.834037 | 0.577155 | -0.509660 | -0.641878 |
| MntFruits | 1.231957 | -0.575915 | 0.026457 | -0.460847 | -0.443694 |
| MntMeatProducts | 1.433987 | -0.689200 | 0.066124 | -0.501474 | -0.585497 |
| MntFishProducts | 1.305966 | -0.594037 | -0.003192 | -0.450304 | -0.481404 |
| MntSweetProducts | 1.271725 | -0.573910 | -0.004226 | -0.465768 | -0.448646 |
| MntGoldProds | 0.746742 | -0.647717 | 0.369503 | -0.141357 | -0.472945 |
| NumDealsPurchases | -0.491833 | -0.398842 | 0.419279 | 1.028615 | -0.142970 |
| NumWebPurchases | 0.502576 | -0.881725 | 0.739072 | 0.060149 | -0.487730 |
| NumCatalogPurchases | 1.168247 | -0.830668 | 0.449881 | -0.520049 | -0.583430 |
| NumStorePurchases | 0.845358 | -0.901067 | 0.750022 | -0.499591 | -0.504823 |
| NumWebVisitsMonth | -0.932880 | 0.506196 | -0.186665 | 1.008658 | -0.158547 |
| TotalSpent | 1.461421 | -0.906373 | 0.380585 | -0.584142 | -0.718143 |
| TotalPurchases | 0.849412 | -1.108842 | 0.857851 | -0.125714 | -0.642281 |
| TotalOffersAccepted | 0.720353 | -0.306585 | -0.027596 | -0.104921 | -0.409960 |
| AmountPerPurchase | 0.945786 | -0.545009 | 0.148309 | -0.334234 | -0.395251 |
| DaysWithCompany | 0.124408 | -0.269283 | 0.117662 | 0.748832 | -0.624865 |
| ValueCount | 481.000000 | 632.000000 | 578.000000 | 276.000000 | 263.000000 |
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']
# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(features_to_plot):
sns.boxplot(x='KMed5Clusters', y=feature, data=df2, ax=axes[i])
axes[i].set_title(feature)
# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
fig.delaxes(axes[i])
plt.tight_layout(pad = 3.0)
plt.show()
Cluster 0
This cluster seems to represent High income individuals once again. These individuals spend a lot per purchase comparatively and purchase a lot of items. They also seem quite receptive to the previous marketing campaigns. They perform bulk shopping all at once.
Cluster 1
This cluster seems to represent relatively new customers. They also don't have high income and are on the younger end of the age spectrum.
Cluster 2
This cluster seems to represent more middle class. They purchase often at the store, their purchasing behavior follows the high income individuals, but just with less amounts overall.
Cluster 3
This cluster seems to represent consistent customers who are lower income. They have been with the company for a while and value the products, but they purchase in different ways than the higher income individuals.
Cluster 4
This cluster seems to represent customers who have churned. They tried the company and then haven't really returned.
This seems to be the most informative of the models so far. We have more behavioral analysis, even though the groupings still break down along income line pretty consistently. This is now the model to beat for our analyses and models going forward
# Creating a copy of the scaled dataframe
hierarchical_df = df_scaled.copy()
# Define distance metrics and linkage methods
distance_metrics = ['euclidean', 'chebyshev', 'mahalanobis', 'cityblock']
linkage_methods = ['single', 'complete', 'average', 'weighted']
high_correlation = 0
high_distandLink = [0, 0]
# Calculate cophenetic correlations for each combination
for dist in distance_metrics:
for link in linkage_methods:
try:
Z = linkage(hierarchical_df, metric=dist, method=link)
c, coph_dists = cophenet(Z, pdist(hierarchical_df))
print("Cophenetic correlation for {} distance and {} linkage is {}.".format(
dist.capitalize(), link, c))
except ValueError:
print(f"{dist}, {link}: ValueError occurred")
if high_correlation < c:
high_correlation = c
high_distandLink[0] = dist
high_distandLink[1] = link
# Printing the higheset combination
print('-'*100)
print(
"Highest cophenetic correlation is {}, with {} distance and {} linkage.".format(
high_correlation, high_distandLink[0].capitalize(), high_distandLink[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.8062977229650866. Cophenetic correlation for Euclidean distance and complete linkage is 0.815553947067454. Cophenetic correlation for Euclidean distance and average linkage is 0.8551976844583773. Cophenetic correlation for Euclidean distance and weighted linkage is 0.6861600812765698. Cophenetic correlation for Chebyshev distance and single linkage is 0.6527214094382623. Cophenetic correlation for Chebyshev distance and complete linkage is 0.6518836492152121. Cophenetic correlation for Chebyshev distance and average linkage is 0.7693410640044095. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.5418169857754074. Cophenetic correlation for Mahalanobis distance and single linkage is 0.7981563287295148. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.6709942422069805. Cophenetic correlation for Mahalanobis distance and average linkage is 0.8203778973606309. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7480302954812181. Cophenetic correlation for Cityblock distance and single linkage is 0.8553682275733636. Cophenetic correlation for Cityblock distance and complete linkage is 0.5786349982207821. Cophenetic correlation for Cityblock distance and average linkage is 0.7909021181194902. Cophenetic correlation for Cityblock distance and weighted linkage is 0.7437982452873503. ---------------------------------------------------------------------------------------------------- Highest cophenetic correlation is 0.8553682275733636, with Cityblock distance and single linkage.
It seems Cityblock with single linkage creates the best correlation, let's compare some of the different linkages in Cityblock distance. We can also try Euclidean distance becaues it has high correlation values, one of which is almost as high as the Cityblock with single linkage
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
Z = linkage(hierarchical_df, metric='cityblock', method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Cityblock Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction")
These clusters all look pretty bad, it seems like we will get 2 pretty sizeable clusters and then a few really small clusters if we use this. Let's check the other distance metrics to see if we can find any good clusters before we decide. We do have high correlation values but these graphs are not initially seeming to give good clusters.
# List of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
Z = linkage(hierarchical_df, metric='euclidean', method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Euclidean Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction")
Ward linkage seems pretty good here visually. We also have a very high correlation value in average linkage. This wouldn't be a bad choice to test because if average linkage clusters poorly we can switch to ward linkage.
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
Z = linkage(hierarchical_df, metric='chebyshev', method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Chebyshev Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction")
Doesn't seem to have very good clusters, this is similar to Cityblock but with worse correlation values.
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
Z = linkage(hierarchical_df, metric='mahalanobis', method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Mahalanobis Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction")
Like chebyshev this doesn't seem to have very good clusters, this is similar to Cityblock but with worse correlation values.
After looking at all the methods Euclidean seems to be the best after looking at them visually. Let's first test average linkage as it has the highest correlation value. If it clusters poorly we will try ward linkage.
HC = AgglomerativeClustering(n_clusters = 4, affinity = "euclidean", linkage = "average")
HC.fit(hierarchical_df)
AgglomerativeClustering(linkage='average', n_clusters=4)
# Creating a copy of the original data
df3 = df.copy()
# Adding hierarchical cluster labels to the dataframes
hierarchical_df["HC_segments"] = HC.labels_
df3["HC_segments"] = HC.labels_
# Creating the cluster profile
hc_cluster_profile = df3.groupby("HC_segments").mean()
# Creating the ValueCount feature for observation purposes
hc_cluster_profile["ValueCount"] = (df3['HC_segments'].value_counts())
hc_cluster_profile = hc_cluster_profile.T
# Highlighting the maximum average value among all the clusters for each of the variables
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| HC_segments | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| Year_Birth | 1968.922662 | 1966.500000 | 1979.000000 | 1978.000000 |
| Income | 51709.240108 | 44171.875000 | 2447.000000 | 51381.500000 |
| Kidhome | 0.444694 | 0.250000 | 1.000000 | 0.000000 |
| Teenhome | 0.507644 | 0.750000 | 0.000000 | 0.000000 |
| Recency | 49.156924 | 30.000000 | 42.000000 | 53.000000 |
| MntWines | 305.961781 | 27.000000 | 1.000000 | 32.000000 |
| MntFruits | 26.468076 | 2.750000 | 1.000000 | 2.000000 |
| MntMeatProducts | 164.392086 | 12.750000 | 1725.000000 | 1607.000000 |
| MntFishProducts | 37.768885 | 2.750000 | 1.000000 | 12.000000 |
| MntSweetProducts | 27.012140 | 132.750000 | 1.000000 | 4.000000 |
| MntGoldProds | 43.874550 | 244.250000 | 1.000000 | 22.000000 |
| NumDealsPurchases | 2.319694 | 0.000000 | 15.000000 | 0.000000 |
| NumWebPurchases | 4.066547 | 25.500000 | 0.000000 | 0.000000 |
| NumCatalogPurchases | 2.632644 | 0.250000 | 28.000000 | 0.000000 |
| NumStorePurchases | 5.828237 | 0.250000 | 0.000000 | 1.000000 |
| NumWebVisitsMonth | 5.336331 | 0.750000 | 1.000000 | 0.000000 |
| AcceptedCmp3 | 0.073291 | 0.000000 | 0.000000 | 0.000000 |
| AcceptedCmp4 | 0.074640 | 0.000000 | 0.000000 | 1.000000 |
| AcceptedCmp5 | 0.073291 | 0.000000 | 0.000000 | 0.000000 |
| AcceptedCmp1 | 0.064748 | 0.000000 | 0.000000 | 0.000000 |
| AcceptedCmp2 | 0.013040 | 0.000000 | 0.000000 | 0.000000 |
| Complain | 0.009442 | 0.000000 | 0.000000 | 0.000000 |
| Response | 0.150180 | 0.000000 | 0.000000 | 0.000000 |
| ChildTotal | 0.952338 | 1.000000 | 1.000000 | 0.000000 |
| HouseholdSize | 2.597122 | 2.250000 | 3.000000 | 2.000000 |
| TotalSpent | 605.477518 | 422.250000 | 1730.000000 | 1679.000000 |
| Age | 47.077338 | 49.500000 | 37.000000 | 38.000000 |
| TotalPurchases | 14.847122 | 26.000000 | 43.000000 | 1.000000 |
| TotalOffersAccepted | 0.449191 | 0.000000 | 0.000000 | 1.000000 |
| AmountPerPurchase | 32.579848 | 16.212963 | 40.232558 | 1679.000000 |
| DaysWithCompany | 902.942896 | 874.250000 | 944.000000 | 1119.000000 |
| ValueCount | 2224.000000 | 4.000000 | 1.000000 | 1.000000 |
df3.groupby(["HC_segments", "Marital_Status"])['Year_Birth'].count()
HC_segments Marital_Status
0 Divorced 230
Married 859
Single 483
Together 575
Widow 77
1 Married 1
Single 3
2 Married 1
3 Together 1
Name: Year_Birth, dtype: int64
This is horrible clustering. We have one cluster with 2224 individuals and 6 total individuals across the other 3 clusters. Let's try hierarchical with ward because even though it has a low cophenetic correlation, it seems to cluster pretty well visually. There seem to be 3 noticeable main clusters so we will use n_clusters = 3.
HC = AgglomerativeClustering(n_clusters = 3, affinity = "euclidean", linkage = "ward")
HC.fit(hierarchical_df)
# Adding hierarchical cluster labels to the dataframes
hierarchical_df["HCClusters"] = HC.labels_
df3["HCClusters"] = HC.labels_
# Creating the cluster profile
hc_cluster_profile = df3.groupby("HCClusters").mean()
# Creating the ValueCount feature for observation purposes
hc_cluster_profile["ValueCount"] = (df3['HCClusters'].value_counts())
hc_cluster_profile = hc_cluster_profile.T
# Highlighting the maximum average value among all the clusters for each of the variables
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| HCClusters | 0 | 1 | 2 |
|---|---|---|---|
| Year_Birth | 1967.497260 | 1971.684829 | 1966.200355 |
| Income | 72689.812329 | 33556.030449 | 54538.756206 |
| Kidhome | 0.050685 | 0.818376 | 0.333333 |
| Teenhome | 0.343836 | 0.460470 | 0.797872 |
| Recency | 49.253425 | 48.900641 | 49.315603 |
| MntWines | 572.979452 | 32.994658 | 410.358156 |
| MntFruits | 62.178082 | 3.979701 | 17.312057 |
| MntMeatProducts | 390.252055 | 18.795940 | 117.932624 |
| MntFishProducts | 89.641096 | 5.619658 | 23.624113 |
| MntSweetProducts | 62.949315 | 4.132479 | 19.131206 |
| MntGoldProds | 73.245205 | 13.430556 | 57.689716 |
| NumDealsPurchases | 1.616438 | 1.970085 | 3.812057 |
| NumWebPurchases | 5.345205 | 1.957265 | 6.049645 |
| NumCatalogPurchases | 5.349315 | 0.446581 | 2.767730 |
| NumStorePurchases | 8.661644 | 3.038462 | 6.732270 |
| NumWebVisitsMonth | 3.141096 | 6.540598 | 6.129433 |
| AcceptedCmp3 | 0.072603 | 0.075855 | 0.069149 |
| AcceptedCmp4 | 0.123288 | 0.009615 | 0.120567 |
| AcceptedCmp5 | 0.208219 | 0.000000 | 0.019504 |
| AcceptedCmp1 | 0.175342 | 0.000000 | 0.028369 |
| AcceptedCmp2 | 0.027397 | 0.002137 | 0.012411 |
| Complain | 0.009589 | 0.012821 | 0.003546 |
| Response | 0.238356 | 0.094017 | 0.127660 |
| ChildTotal | 0.394521 | 1.278846 | 1.131206 |
| HouseholdSize | 2.021918 | 2.923077 | 2.797872 |
| TotalSpent | 1251.245205 | 78.952991 | 646.047872 |
| Age | 48.502740 | 44.315171 | 49.799645 |
| TotalPurchases | 20.972603 | 7.412393 | 19.361702 |
| TotalOffersAccepted | 0.845205 | 0.181624 | 0.377660 |
| AmountPerPurchase | 64.101006 | 9.934937 | 32.178778 |
| DaysWithCompany | 909.200000 | 872.924145 | 944.914894 |
| HC_segments | 0.004110 | 0.000000 | 0.010638 |
| ValueCount | 730.000000 | 936.000000 | 564.000000 |
df3.groupby(["HCClusters", "Marital_Status"])['Year_Birth'].count()
HCClusters Marital_Status
0 Divorced 77
Married 261
Single 164
Together 197
Widow 31
1 Divorced 91
Married 365
Single 218
Together 238
Widow 24
2 Divorced 62
Married 235
Single 104
Together 141
Widow 22
Name: Year_Birth, dtype: int64
df3.groupby(["HCClusters", "Education"])['Year_Birth'].count()
HCClusters Education
0 Basic 1
Graduation 402
Master 173
PhD 154
1 Basic 51
Graduation 454
Master 252
PhD 179
2 Basic 2
Graduation 266
Master 147
PhD 149
Name: Year_Birth, dtype: int64
df3.groupby(["HCClusters", "HouseholdSize"])['Year_Birth'].count()
HCClusters HouseholdSize
0 1 188
2 353
3 174
4 15
1 1 38
2 238
3 440
4 198
5 22
2 1 26
2 169
3 272
4 87
5 10
Name: Year_Birth, dtype: int64
df3.groupby(["HCClusters", "TotalOffersAccepted"])['Year_Birth'].count()
HCClusters TotalOffersAccepted
0 0 418
1 149
2 75
3 43
4 36
5 9
1 0 799
1 104
2 33
2 0 404
1 117
2 34
3 8
4 1
Name: Year_Birth, dtype: int64
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']
# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(features_to_plot):
sns.boxplot(x='HCClusters', y=feature, data=df3, ax=axes[i])
axes[i].set_title(feature)
# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
fig.delaxes(axes[i])
plt.tight_layout(pad = 3.0)
plt.show()
Cluster 0
This cluster seems to represent high income individuals who make a lot of purchases.
Cluster 1
This cluster seems to represent low income individuals, they also seem to have a high family size and spend more on gold than any other product.
Cluster 2
This cluster seems to represent middle class individuals, they seem to really like to buy high end products (wine and gold) at our location proportionately speaking compared to the other clusters and products.
It seems that Hierarchical Clustering has given us the same breakdown as our K-Means and K-Medoid models. Our dataset seems to be pretty heavily reliant on money spent, which means we are going to break down into clusters that follow low, middle, and high income patterns as they have differing amounts of money to be able to spend.
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
dbscan_df = df_scaled.copy()
# Define the parameter search space
eps_values = np.arange(0.5, 10, 0.5)
min_samples_values = range(2, 25)
# Initialize the best parameters and silhouette score
best_eps = None
best_min_samples = None
best_silhouette = -1
# Perform Grid Search
for eps in eps_values:
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples).fit(dbscan_df)
labels = dbscan.labels_
# Calculate the silhouette score for the current configuration
# Ignore the noise points (label -1)
if len(set(labels)) > 1:
silhouette = silhouette_score(dbscan_df, labels, metric='euclidean')
else:
silhouette = -1
# Update the best parameters if the current configuration has a better silhouette score
if silhouette > best_silhouette:
best_eps = eps
best_min_samples = min_samples
best_silhouette = silhouette
# Print the best parameters and silhouette score
print(f"Best eps: {best_eps}")
print(f"Best min_samples: {best_min_samples}")
print(f"Best silhouette score: {best_silhouette}")
Best eps: 7.5 Best min_samples: 2 Best silhouette score: 0.7965250624155766
# Applying DBSCAN with eps as 7.5 and min sample as 2
dbs = DBSCAN(eps = 7.5, min_samples = 2)
# Creating a copy of the original data
df4 = df.copy()
# Add DBSCAN cluster labels to dbscan data
dbscan_df["DBClusters"] = dbs.fit_predict(dbscan_df)
# Add DBSCAN cluster labels to whole data
df4["DBClusters"] = dbs.fit_predict(dbscan_df)
# Creating the cluster profile
db_cluster_profile = df4.groupby("DBClusters").mean()
# Creating the ValueCount feature for observation purposes
db_cluster_profile["ValueCount"] = (df4['DBClusters'].value_counts())
db_cluster_profile = db_cluster_profile.T
# Highlighting the maximum average value among all the clusters for each of the variables
db_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| DBClusters | -1 | 0 |
|---|---|---|
| Year_Birth | 1978.500000 | 1968.918312 |
| Income | 26914.250000 | 51695.708034 |
| Kidhome | 0.500000 | 0.444345 |
| Teenhome | 0.000000 | 0.508079 |
| Recency | 47.500000 | 49.122531 |
| MntWines | 16.500000 | 305.460952 |
| MntFruits | 1.500000 | 26.425494 |
| MntMeatProducts | 1666.000000 | 164.119838 |
| MntFishProducts | 6.500000 | 37.706014 |
| MntSweetProducts | 2.500000 | 27.201975 |
| MntGoldProds | 11.500000 | 44.234291 |
| NumDealsPurchases | 7.500000 | 2.315530 |
| NumWebPurchases | 0.000000 | 4.105027 |
| NumCatalogPurchases | 14.000000 | 2.628366 |
| NumStorePurchases | 0.500000 | 5.818223 |
| NumWebVisitsMonth | 0.500000 | 5.328097 |
| AcceptedCmp3 | 0.000000 | 0.073160 |
| AcceptedCmp4 | 0.500000 | 0.074506 |
| AcceptedCmp5 | 0.000000 | 0.073160 |
| AcceptedCmp1 | 0.000000 | 0.064632 |
| AcceptedCmp2 | 0.000000 | 0.013016 |
| Complain | 0.000000 | 0.009425 |
| Response | 0.000000 | 0.149910 |
| ChildTotal | 0.500000 | 0.952424 |
| HouseholdSize | 2.500000 | 2.596499 |
| TotalSpent | 1704.500000 | 605.148564 |
| Age | 37.500000 | 47.081688 |
| TotalPurchases | 22.000000 | 14.867145 |
| TotalOffersAccepted | 0.500000 | 0.448384 |
| AmountPerPurchase | 859.616279 | 32.550464 |
| DaysWithCompany | 1031.500000 | 902.891382 |
| ValueCount | 2.000000 | 2228.000000 |
This is a horrible clustering. We have 2 clusters, one with 2 datapoints and another with 2228. We will try increasing the minimum samples and finding another model, but it is seeming like DBSCAN doesn't perform well on our dataset.
# Define the parameter search space
eps_values = np.arange(0.5, 10, 0.5)
min_samples_values = np.arange(50, 250, 5)
# Initialize the best parameters and silhouette score
best_eps = None
best_min_samples = None
best_silhouette = -1
# Perform Grid Search
for eps in eps_values:
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples).fit(dbscan_df)
labels = dbscan.labels_
# Calculate the silhouette score for the current configuration
# Ignore the noise points (label -1)
if len(set(labels)) > 1:
silhouette = silhouette_score(dbscan_df, labels, metric='euclidean')
else:
silhouette = -1
# Update the best parameters if the current configuration has a better silhouette score
if silhouette > best_silhouette:
best_eps = eps
best_min_samples = min_samples
best_silhouette = silhouette
# Print the best parameters and silhouette score
print(f"Best eps: {best_eps}")
print(f"Best min_samples: {best_min_samples}")
print(f"Best silhouette score: {best_silhouette}")
Best eps: 9.0 Best min_samples: 50 Best silhouette score: 0.7967082671882659
# Applying DBSCAN with eps as 9 and min sample as 50
dbs = DBSCAN(eps = 9, min_samples = 50)
# Creating a copy of the original data
df4 = df.copy()
# Add DBSCAN cluster labels to dbscan data
dbscan_df["DBClusters"] = dbs.fit_predict(dbscan_df)
# Add DBSCAN cluster labels to whole data
df4["DBClusters"] = dbs.fit_predict(dbscan_df)
# Creating the cluster profile
db_cluster_profile = df4.groupby("DBClusters").mean()
# Creating the ValueCount feature for observation purposes
db_cluster_profile["ValueCount"] = (df4['DBClusters'].value_counts())
db_cluster_profile = db_cluster_profile.T
# Highlighting the maximum average value among all the clusters for each of the variables
db_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| DBClusters | -1 | 0 |
|---|---|---|
| Year_Birth | 1978.500000 | 1968.918312 |
| Income | 26914.250000 | 51695.708034 |
| Kidhome | 0.500000 | 0.444345 |
| Teenhome | 0.000000 | 0.508079 |
| Recency | 47.500000 | 49.122531 |
| MntWines | 16.500000 | 305.460952 |
| MntFruits | 1.500000 | 26.425494 |
| MntMeatProducts | 1666.000000 | 164.119838 |
| MntFishProducts | 6.500000 | 37.706014 |
| MntSweetProducts | 2.500000 | 27.201975 |
| MntGoldProds | 11.500000 | 44.234291 |
| NumDealsPurchases | 7.500000 | 2.315530 |
| NumWebPurchases | 0.000000 | 4.105027 |
| NumCatalogPurchases | 14.000000 | 2.628366 |
| NumStorePurchases | 0.500000 | 5.818223 |
| NumWebVisitsMonth | 0.500000 | 5.328097 |
| AcceptedCmp3 | 0.000000 | 0.073160 |
| AcceptedCmp4 | 0.500000 | 0.074506 |
| AcceptedCmp5 | 0.000000 | 0.073160 |
| AcceptedCmp1 | 0.000000 | 0.064632 |
| AcceptedCmp2 | 0.000000 | 0.013016 |
| Complain | 0.000000 | 0.009425 |
| Response | 0.000000 | 0.149910 |
| ChildTotal | 0.500000 | 0.952424 |
| HouseholdSize | 2.500000 | 2.596499 |
| TotalSpent | 1704.500000 | 605.148564 |
| Age | 37.500000 | 47.081688 |
| TotalPurchases | 22.000000 | 14.867145 |
| TotalOffersAccepted | 0.500000 | 0.448384 |
| AmountPerPurchase | 859.616279 | 32.550464 |
| DaysWithCompany | 1031.500000 | 902.891382 |
| ValueCount | 2.000000 | 2228.000000 |
This is still horrible. It seems DBSCAN does not perform well on this dataset.
# Creating a copy of the original and scaled dataset
gmm_df = df_scaled.copy()
df5 = df.copy()
# Define the range of k values
k_values = range(2, 11)
# Store the silhouette scores for each k value
silhouette_scores = []
# Loop through the k values and fit GMM models
for k in k_values:
gmm = GaussianMixture(n_components=k, random_state=1)
gmm_labels = gmm.fit_predict(gmm_df)
# Calculate the silhouette score for the current k value
silhouette_avg = silhouette_score(gmm_df, gmm_labels)
silhouette_scores.append(silhouette_avg)
# Plot the silhouette scores for different k values
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for different k values of GMM')
plt.show()
This graph seems to follow our K-Means and K-Medoid silhouette graphs, and since we gained more meaningful insights from K=5 there we will try K=5 here as well.
# Applying Gaussian Mixture
gmm = GaussianMixture(n_components = 5, random_state = 1)
gmm.fit(gmm_df)
# Adding labels to the datasets
gmm_df["GMMClusters"] = gmm.fit_predict(gmm_df)
df5["GMMClusters"] = gmm.fit_predict(gmm_df)
# Creating the cluster profile
gmm_cluster_profile = df5.groupby("GMMClusters").mean()
# Creating the ValueCount feature for observation purposes
gmm_cluster_profile["ValueCount"] = (df5['GMMClusters'].value_counts())
gmm_cluster_profile = gmm_cluster_profile.T
# Highlighting the maximum average value among all the clusters for each of the variables
gmm_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
| GMMClusters | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| Year_Birth | 1970.942308 | 1972.072273 | 1966.938697 | 1966.994129 | 1967.077572 |
| Income | 76542.081731 | 32186.548620 | 60662.212644 | 73821.782779 | 49277.837268 |
| Kidhome | 0.086538 | 0.851511 | 0.191571 | 0.039139 | 0.445194 |
| Teenhome | 0.173077 | 0.444152 | 0.704981 | 0.301370 | 0.738617 |
| Recency | 46.317308 | 48.254928 | 45.628352 | 50.616438 | 50.973019 |
| MntWines | 858.692308 | 20.131406 | 614.210728 | 547.921722 | 228.801012 |
| MntFruits | 62.471154 | 3.072273 | 34.934866 | 66.745597 | 11.499157 |
| MntMeatProducts | 549.269231 | 12.904074 | 179.440613 | 415.455969 | 72.369309 |
| MntFishProducts | 85.836538 | 4.742444 | 46.839080 | 97.859100 | 15.607083 |
| MntSweetProducts | 60.826923 | 3.323259 | 42.291188 | 66.610568 | 11.264755 |
| MntGoldProds | 78.682692 | 10.272011 | 80.153257 | 75.606654 | 38.822934 |
| NumDealsPurchases | 1.076923 | 1.771353 | 3.609195 | 1.475538 | 3.403035 |
| NumWebPurchases | 4.182692 | 1.633377 | 6.796935 | 5.395303 | 4.952782 |
| NumCatalogPurchases | 4.951923 | 0.318003 | 4.724138 | 5.610568 | 1.731872 |
| NumStorePurchases | 6.442308 | 2.885677 | 8.609195 | 8.667319 | 5.770658 |
| NumWebVisitsMonth | 4.086538 | 6.392904 | 6.091954 | 2.757339 | 6.042159 |
| AcceptedCmp3 | 0.134615 | 0.070959 | 0.157088 | 0.046967 | 0.050590 |
| AcceptedCmp4 | 0.355769 | 0.003942 | 0.168582 | 0.068493 | 0.080944 |
| AcceptedCmp5 | 0.586538 | 0.000000 | 0.091954 | 0.150685 | 0.001686 |
| AcceptedCmp1 | 0.394231 | 0.000000 | 0.107280 | 0.129159 | 0.015177 |
| AcceptedCmp2 | 0.096154 | 0.002628 | 0.053640 | 0.003914 | 0.001686 |
| Complain | 0.009615 | 0.014455 | 0.007663 | 0.007828 | 0.005059 |
| Response | 0.576923 | 0.077530 | 0.252874 | 0.158513 | 0.114671 |
| ChildTotal | 0.259615 | 1.295664 | 0.896552 | 0.340509 | 1.183811 |
| HouseholdSize | 1.865385 | 2.940867 | 2.490421 | 1.990215 | 2.851602 |
| TotalSpent | 1695.778846 | 54.445466 | 997.869732 | 1270.199609 | 378.364250 |
| Age | 45.057692 | 43.927727 | 49.061303 | 49.005871 | 48.922428 |
| TotalPurchases | 16.653846 | 6.608410 | 23.739464 | 21.148728 | 15.858347 |
| TotalOffersAccepted | 2.144231 | 0.155059 | 0.831418 | 0.557730 | 0.264755 |
| AmountPerPurchase | 118.281469 | 7.973858 | 42.427881 | 61.289262 | 22.731530 |
| DaysWithCompany | 951.173077 | 849.611038 | 1029.685824 | 885.743640 | 922.202361 |
| ValueCount | 104.000000 | 761.000000 | 261.000000 | 511.000000 | 593.000000 |
This looks like a good clustering. We have sizeable clusters with a variety of different maximum average values among different clusters and different features. There seem to be 3 main clusters and 2 smaller but still recognizeable clusters.
df5.groupby(["GMMClusters", "Marital_Status"])['Year_Birth'].count()
GMMClusters Marital_Status
0 Divorced 13
Married 39
Single 25
Together 24
Widow 3
1 Divorced 74
Married 297
Single 178
Together 194
Widow 18
2 Divorced 41
Married 94
Single 55
Together 61
Widow 10
3 Divorced 42
Married 197
Single 113
Together 135
Widow 24
4 Divorced 60
Married 234
Single 115
Together 162
Widow 22
Name: Year_Birth, dtype: int64
df5.groupby(["GMMClusters", "Education"])['Year_Birth'].count()
GMMClusters Education
0 Basic 1
Graduation 49
Master 23
PhD 31
1 Basic 45
Graduation 368
Master 210
PhD 138
2 Basic 3
Graduation 124
Master 57
PhD 77
3 Graduation 291
Master 124
PhD 96
4 Basic 5
Graduation 290
Master 158
PhD 140
Name: Year_Birth, dtype: int64
df5.groupby(["GMMClusters", "HouseholdSize"])['Year_Birth'].count()
GMMClusters HouseholdSize
0 1 36
2 50
3 15
4 2
5 1
1 1 27
2 192
3 359
4 165
5 18
2 1 35
2 95
3 102
4 26
5 3
3 1 129
2 261
3 118
4 3
4 1 25
2 162
3 292
4 104
5 10
Name: Year_Birth, dtype: int64
df5.groupby(["GMMClusters", "TotalOffersAccepted"])['Year_Birth'].count()
GMMClusters TotalOffersAccepted
0 0 23
1 20
2 16
3 15
4 24
5 6
1 0 667
1 70
2 24
2 0 145
1 58
2 32
3 12
4 11
5 3
3 0 322
1 118
2 48
3 21
4 2
4 0 464
1 104
2 22
3 3
Name: Year_Birth, dtype: int64
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']
# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()
for i, feature in enumerate(features_to_plot):
sns.boxplot(x='GMMClusters', y=feature, data=df5, ax=axes[i])
axes[i].set_title(feature)
# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
fig.delaxes(axes[i])
plt.tight_layout(pad = 3.0)
plt.show()
We seem to have slightly more variability in these clusters. They potentially tell us a bit more about the patterns and different types of customers, however they do still follow income pretty consistently.
Cluster 0
This cluster seems to represent our high income whales who like luxury products. They don't make many purchases but make sizeable purchases of expensive wines and meat when they do.
Recommendation: Their spending pattern makes them an attractive target for marketing campaigns focusing on high-margin products and personalized promotions.
Cluster 1
This cluster seems to represent new customers or customers who might not be able to afford our products, but enjoy them and so they purchase them on occasion when they can find something they can afford.
Recommendation: To increase their engagement and spending, the company could consider offering affordable or entry-level products, personalized promotions, or targeted online advertising to appeal to their preferences and financial capabilities.
Cluster 2
This cluster seems to represent our passionate and loyal customers. They love our store and will do a lot of shopping here, however they don't have as much income to spare as our whales or higher income individuals.
Recommendation: The company could benefit from offering targeted promotions and deals on gold and wine products to this group, as well as continuing to provide a seamless online shopping experience. Maintaining strong relationships with these loyal customers and nurturing their long-term engagement with the company will be crucial for retaining their business and maximizing their lifetime value.
Cluster 3
This cluster seems to represent our most diverse buyers. They purchase across all categories and make a lot of purchases with high amounts spent.
Recommendation: The company should try to market bundles to these customers. Given that these customers are relatively new to the company we also must focus on making sure their experience is as streamlined and enjoyable as possible to build a good rapport and relationship with these customers to encourage long-term engagement and maximize their lifetime value.
Cluster 4
This cluster seems to represent indifferent buyers. They show up periodically to get a few goods, but they don't have a strong affiliation to our brand and they don't currently spend much.
Recommendation: To cater to this group, the company could focus on enhancing their online shopping experience and offering attractive deals, especially for wines and gold products. It's essential to consider their lower-middle-class income and design promotions that appeal to their budget while still providing good value. Since they make an average number of purchases, targeted marketing and personalized recommendations could encourage them to shop more frequently and increase their overall spending with the company.
1. Comparison of various techniques and their relative performance based on silhouette score (Measure of success):
# Calculating K-Means Silhouette score at K = 3
kmeans = KMeans(n_clusters = 3, random_state = 1)
prediction = kmeans.fit_predict((df_scaled))
s_score = silhouette_score(df_scaled, prediction)
print(s_score)
0.27186197643097576
# Calculating K-Medoids Silhouette score at K = 5
kmedoids = KMedoids(n_clusters = 5, random_state = 1)
prediction = kmedoids.fit_predict((df_scaled))
s_score = silhouette_score(df_scaled, prediction)
print(s_score)
0.1077092526484127
# Calculating Agglomerative Clustering Silhouette score with Euclidian distance and ward linkage with Clusters = 3
HCm = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
prediction = HCm.fit_predict((df_scaled))
s_score = silhouette_score(df_scaled, prediction)
print(s_score)
0.21913220436763914
# Calculating Gaussian Mixture Silhouette score at K = 5
GMM = GaussianMixture(n_components = 5, random_state = 1)
prediction = GMM.fit_predict((df_scaled))
s_score = silhouette_score(df_scaled, prediction)
print(s_score)
0.14694315739206412
In terms of Silhouette score K-Means with K = 3 is providing the best result. However this didn't really provide us with any informative clusters. All our silhouette scores are quite low which means our dataset doesn't cluster super well, at which point we will choose the clustering technique that provided us with the most useful insights rather than the one with the highest silhouette score.
Note: We will also take insights that we gained from the initial exploratory data analysis and the less successful models as there are still some interesting patterns and recommendations to be gained there.
2. Refined insights:
Web visits negatively correlate with purchases of all product types and streams, and lower-income individuals tend to use the web the most, Perhaps the marketing campaign should put more emphasis on web advertisements, and specifically deals or affordable products. This could convince the low income individuals to make more purchases with our business.
The correlation between meat and wines, the fact that these are our highest earning products, and that higher income individuals like these products. We should advertise meat and wines in the catalog's more as that is where high income individuals like to order and they spend large amounts there, maybe put these front and center. We can also move these to the front of the store to make them more noticeable to individuals who enter.
Gold is a product category that is typically seen as high status, but proportionately is being purchased less than wines and meats, our data science team should look into this to see if we could profit by perhaps bundling or advertising these to high income individuals or certain clusters, or if there is some important reason as to why gold isn't being purchased as much as we might think here. We need more insights into this product.
Our dataset is highly monetary based. If we look into gathering data surrounding product sales numbers rather than amount spent we could perhaps determine more informative clusters and customer segments.
3. Proposal for the final solution design:
The model that gives us the best insights is our Gaussian Mixture model with 5 components (K = 5). Here we see 3 main clusters and 2 secondary clusters. For our marketing campaign we will utilize the 5 segments from this model to create recommendations. We will also take into account information from general demographic observations from the overall case study to make recommendations to the data science team and the data collections team.
The 5 clusters from the GMM model are as follows:
In the report below we will provide detailed recommendations for our marketing team, our data science team, and our data collections team on how to proceed regarding these clusters and the data we've gathered as well as what future data to gather.
This project proposes a Gaussian Mixture Model (GMM) for customer segmentation in a retail context, focusing on understanding demographics, product preferences, and purchasing channels to inform marketing strategies and data-driven decisions. Our analysis identified five distinct customer segments (three primary segments and two secondary), providing valuable insights to target specific customer groups and optimize marketing efforts across various channels.
Based on the segmentation, we recommend that stakeholders focus on web advertisements and affordable products for lower-income customers, emphasize high-margin products like meats and wines in catalogs and stores for high-income customers, and investigate the purchasing habits surrounding the gold product category. In addition, we advise refining and expanding the GMM model and investing in new data collection (such as number of products sold per consumer in each category) to further enhance accuracy and usefulness. By tailoring marketing campaigns to the identified customer segments, the company can better serve its diverse customer base and maximize profits.
The retail industry faces challenges in understanding and catering to diverse customer preferences, purchasing habits, and demographics. As competition grows, retailers need to develop targeted marketing campaigns to attract and retain customers effectively. The key objective of this project is to analyze customer data and identify distinct customer segments, enabling the retail company to better understand its clientele and create tailored marketing strategies to enhance customer engagement and increase revenue.
Our analysis focuses on customer demographics, product preferences, and purchasing channels, exploring various clustering methods to identify the most effective model for segmenting the customer base. These insights will enable the company to optimize marketing efforts across various channels, targeting specific customer groups based on their unique characteristics. The resulting segmentation and recommendations aim to support the company in refining its marketing campaigns, ultimately leading to improved customer satisfaction and increased profits.
To address the challenge of identifying distinct customer segments, various clustering techniques and models were explored, including K-means, K-medoids, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Model (GMM). The process involved preprocessing the data, scaling, and reducing dimensionality using PCA. The optimal number of clusters was determined using the silhouette score and other evaluation metrics, ensuring a proper balance between the diversity of the clusters and the model's performance.
The final proposed solution is the Gaussian Mixture Model with five components, as it provided the most meaningful and diverse customer segments. These clusters enabled us to analyze customer demographics, product preferences, and purchasing channels more effectively. Our analysis revealed insights into customer behavior that could be leveraged to create tailored marketing strategies and enhance customer engagement.
While the GMM model had a silhouette score of 0.15 as compared to the highest K-Means model score of 0.27, the chosen GMM model is valid and optimal as it offers a higher level of flexibility in terms of cluster shape and size compared to other clustering techniques. Furthermore, it allows for a probability-based approach to customer segmentation, which can provide additional insights into customer behavior. This solution design directly impacts the retail business by enabling the company to optimize its marketing efforts, target specific customer segments, and ultimately improve customer satisfaction and increase revenue. By understanding and addressing the unique needs of each customer group, the retailer can make informed decisions to cater to their preferences and shopping habits, resulting in a more effective marketing strategy.
The 5 clusters from the GMM model are as follows:
1. Targeted marketing for each cluster:
2. Key actionables for stakeholders:
3. Expected benefits and costs:
4. Key risks and challenges:
By implementing these marketing campaign recommendations, the retail business can expect to see improved customer satisfaction, increased revenue, and stronger customer engagement. However, stakeholders must remain vigilant to potential risks and challenges, adapting their strategies as needed to maintain the effectiveness of their marketing efforts.
We must remain aware of the possibility of customers' preferences evolving over time or not responding as expected to the tailored marketing campaigns. Additionally, there may be challenges in accurately targeting the right customer segments due to the probabilistic nature of the GMM clustering. To mitigate these risks, the company should continually monitor customer behavior, update customer segmentation, and adjust marketing strategies accordingly.
1. Further analysis on the Gaussian Mixture Model (GMM):
2. Expand data collection to include new features:
3. Investigate the gold product category further:
4. Monitor changes in customer behavior over time:
5. Assess the impact of external factors on customer behavior:
By addressing these recommendations, the data science team can refine the current clustering model, gain deeper insights into customer behavior, and identify new opportunities for growth. This will also help the retail business to adapt its marketing strategies to account for changes in customer preferences and external factors, ensuring long-term success.